

and the second of the second o





The Patent Office Concept House Cardiff Road Newport South Wales NP10 8QQ

I, the undersigned, being an officer duly authorised in accordance with Section 74(1) and (4) of the Deregulation & Contracting Out Act 1994, to sign and issue certificates on behalf of the Comptroller-General, hereby certify that annexed hereto is a true copy of the documents as originally filed in connection with the patent application identified therein.

In accordance with the Patents (Companies Re-registration) Rules 1982, if a company named in this certificate and any accompanying documents has re-registered under the Companies Act 1980 with the same name as that with which it was registered immediately before re-registration save for the substitution as, or inclusion as, the last part of the name of the words "public limited company" or their equivalents in Welsh, references to the name of the company in this certificate and any accompanying documents shall be treated as references to the name with which it is so re-registered.

accordance with the rules, the words "public limited company" may be replaced by p.l.c., p.L.C. or PLC.

Registration under the Companies Act does not constitute a new legal entity but merely success the company to certain additional company law rules.

Signed

ed the servers

Dated

18 March 2004

THIS PAGE BLANK (USPTO)

Patents Form 1/77

8. Is a statement of inventorship and

support of this application?

of right to grant a patent required in

2 2 APR 2003

THE PATENT OFFICE

Request for grant of a patent 2 2 APR 2003

NEWPORT

The Patent Office

Cardiff Road Newport NP9 1RH

|    | •                                                                                                        | INTAAL CITT                                                                                                                      |                                                     |
|----|----------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------|
| 1. | Your Reference                                                                                           | IMR/CEE/Y808                                                                                                                     | 22APR03 E801531-1 D02846<br>P01/7700 0.00-0309056.0 |
| 2. | Application number                                                                                       | 0309056.                                                                                                                         | 0                                                   |
| 3. | Full name, address and postcode of the or each Applicant  Country/state of incorporation (if applicable) | Transitive Technologies Limited Suite 2B The Triangle Hanging Ditch Manchester M4 3TR  82/357/50 Incorporated in: United Kingdom |                                                     |
| 4. | Title of the invention                                                                                   | Block Translation Optimizations for Program Code<br>Conversion                                                                   |                                                     |
| 5. | Name of agent                                                                                            | APPLEYARD LEES                                                                                                                   |                                                     |
|    | Address for service in the UK to which all correspondence should be sent                                 | 15 CLARE ROAD<br>HALIFAX<br>HX1 2HY                                                                                              |                                                     |
|    | Patents ADP number                                                                                       | 190001                                                                                                                           | ·                                                   |
| 6. | Priority claimed to:                                                                                     | Country Appli                                                                                                                    | cation number Date of filing                        |
| 7. | Divisional status claimed from:                                                                          | Number of parent application                                                                                                     | Date of filing                                      |
|    |                                                                                                          |                                                                                                                                  |                                                     |

YES

| 9. Enter the number of sheets for any of the following items you are filing with this form. Do not count copies of the same document |                                                                                     |                |  |
|--------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------|----------------|--|
| Continuation sheets of this form                                                                                                     | ·                                                                                   |                |  |
| Description                                                                                                                          | 59                                                                                  |                |  |
| Claim(s)                                                                                                                             | 6 Q                                                                                 |                |  |
| Abstract                                                                                                                             | 1                                                                                   |                |  |
| Drawing(s)                                                                                                                           | 8 6                                                                                 |                |  |
| <ol> <li>If you are also filing any of the<br/>following, state how many against<br/>each item</li> </ol>                            |                                                                                     |                |  |
| Priority documents                                                                                                                   | -                                                                                   |                |  |
| Translation of priority documents                                                                                                    | -                                                                                   |                |  |
| Statement of inventorship and right to grant a patent (PF 7/77)                                                                      | -                                                                                   |                |  |
| Request for a preliminary examination and search (PF 9/77)                                                                           | -                                                                                   |                |  |
| Request for substantive examination (PF 10/77)                                                                                       | -                                                                                   |                |  |
| Any other documents (please specify)                                                                                                 | -                                                                                   |                |  |
| 11.                                                                                                                                  | We request the grant of a patent on the basis of this application.  Signature  Date |                |  |
|                                                                                                                                      | APPLEYARD LEES                                                                      | 17 April 2003  |  |
|                                                                                                                                      | Sypholler                                                                           |                |  |
| 2. Contact                                                                                                                           | Ian Robinson- 01422 330110                                                          | - 01422 330110 |  |

# BLOCK TRANSLATION OPTIMIZATIONS FOR PROGRAM CODE CONVERSION

The subject invention relates generally to the field of computers and computer software and, more particularly, to program code conversion methods and apparatus useful, for example, in code translators, emulators and accelerators.

In both embedded and non-embedded CPU's, one finds 10 predominant Instruction Set Architectures (ISAs) for which large bodies of software exist that could be "accelerated" for performance, or "translated" to a myriad of capable processors that could present better cost/performance benefits, provided that they could transparently access 15 One also finds dominant the relevant software. architectures that are locked in time to their ISA, cannot evolve in performance or market reach. Such architectures would benefit from "Synthetic CPU" coarchitecture. 20

Program code conversion methods and apparatus facilitate such acceleration, translation and coarchitecture capabilities and are addressed, for example, in WO 99/03168 entitled Program Code Conversion.

25

According to the present invention there is provided an apparatus and method as set forth in the appended claims. Preferred features of the invention will be apparent from the dependent claims, and the description which follows.

(

5

20

The following is a summary of various aspects and advantages realizable according to various embodiments according to the invention. It is provided as an introduction to assist those skilled in the art to more rapidly assimilate the detailed design discussion that ensues and does not and is not intended in any way to limit the scope of the claims that are appended hereto.

In particular, the inventors have developed a number of optimization techniques directed at expediting program code conversion, particularly useful in connection with a run-time translator which employs translation of successive basic blocks of subject program code into target code wherein the target code corresponding to a first basic block is executed prior to generation of target code for the next basic block.

In one such optimization referred to as "extended basic blocks," the translator detects whether the starting address of a successor basic block is statically determinable, and, if so, performs target code generation for both the initial basic block and its successor together before execution of the target code.

In another process referred to as "group blocking", the translator applies a selection algorithm to identify a group of previously translated basic blocks for retranslation as a contiguous whole. Such an algorithm may compare a profiling metric stored in a basic block data structure with a profiling threshold, for example, to determine when to initiate group block creation and to determine which basic blocks to include in a group block. In its simplest case, the profiling metric may be

execution count, i.e., the number of times a particular basic block has been executed. The blocks selected for the group are preferably ordered and subjected to further optimizations, including global dead code elimination and global register allocation, prior to target code generation.

A particularly advantageous embodiment provides a cached basic block data structure which facilitates implementation of extended basic blocks and group blocking, together with isoblocking and cached translation state.

For a better understanding of the invention, and to show how embodiments of the same may be carried into effect, reference will now be made, by way of example, to the accompanying diagrammatic drawings in which:

Figure 1 is a block diagram of apparatus wherein 20 embodiments of the invention find application;

Figure 2 is a schematic diagram illustrating a runtime translation process and corresponding IR (intermediate representation) generated during the process;

Figure 3 is a schematic illustrating a basic block data structure and cache according to an illustrative embodiment of the invention;

30

25

(

5

10

Figure 4 is a flow diagram illustrating an extended basic block process;

Figure 5 is a flow diagram illustrating isoblocking;

Figure 6 is a flow diagram illustrating group blocking and attendant optimizations; and

Figure 7 is a schematic diagram of an example illustrating group block optimization.

Figure 8 is a flow diagram illustrating run-time 10 translation, including extended basic blocking, isoblocking, and group blocking.

5

15

20

25

30

Illustrative apparatus for implementing various novel features discussed below is shown in Figure 1. Figure 1. illustrates а target processor 13 including target registers 15 together with memory 18 storing a number of software components 19, 20, 21, and providing working storage 16 including a basic block cache 23, a global register store 27, and the subject code 17 translated. The software components include an operating system 20, the translator code 19, and translated code 21. The translator code 19 may function, for example, as an emulator translating subject of code one ISA into translated code of another ISA or as an accelerator for translating subject code into translated code, each of the same ISA.

The translator 19, i.e., the compiled version of the source code implementing the translator, and the translated code 21, i.e., the translation of the subject code 17 produced by the translator 19, run in conjunction with the operating system 20 such as, for example, UNIX running on the target processor 13, typically a

microprocessor or other suitable computer. It will be appreciated that the structure illustrated in Figure 1 is exemplary only and that, for example, software, methods according to the invention mav processes residing within or beneath an implemented in code operating system. The subject code, translator code, operating system, and storage mechanisms may be any of a wide variety of types, as known to those skilled in the art.

10

15

20

25

30

(

In apparatus according to Figure 1, program code conversion is preferably performed dynamically, at runtime, while the translated code 21 is running. The translator 19 runs inline with the translated program 21. The execution path of the translation process is a control loop comprising the steps of: executing translator code 19, which translates a block of the subject code 17 into translated code 21, and then executing that block of translated code; the end of each block of translated code instructions to return control back to the contains the steps of translator code 19. In other words, translating and then executing the subject code interlaced, such that only portions of the subject program 17 are translated at a time and the translated code of a first basic block is executed prior to the translation of The translator's fundamental subsequent basic blocks. unit of translation is the basic block, meaning that the translator 19 translates the subject code 17 one basic block at a time. A basic block is formally defined as a section of code with exactly one entry point and exactly one exit point, which limits the block code to a single For this reason, basic blocks are the control path. fundamental unit of control flow.

In the process of generating the translated code 21, intermediate representation ("IR") trees are generated based on the subject instruction sequence. IR trees are abstract representations of the expressions calculated and operations performed by the subject program. Later, translated code 21 is generated based on the IR trees.

The collections of IR nodes described herein are 10 colloquially referred to as "trees". We note that, formally, such structures are in fact directed acyclic graphs (DAGs), not trees. The formal definition of a tree requires that each node have at most one parent. embodiments described use common subexpression elimination during IR generation, nodes will often have 15 multiple parents. For example, the IR of a flag-affecting instruction result may be referred to by two abstract registers, those corresponding to the destination subject register and the flag result parameter.

20

25

30

For example, the subject instruction "add %r1, %r2, %r3" performs the addition of the contents of subject registers %r2 and %r3 and stores the result in subject Thus, this instruction corresponds to the register %r1. abstract expression "%r1 = %r2 + %r3". This example contains a definition of the abstract register %r1 with an add expression containing two subexpressions representing the instruction operands %r2 and %r3. In the context of a subject program 17, these subexpressions may correspond to other, prior subject instructions, or they may represent details of the current instruction such as immediate constant values.

10

15

20

25

30

When the "add" instruction is parsed, a new "+" the corresponding to abstract generated, **"+**" IR node mathematical operator for addition. The stores references to other IR nodes that represent the operands (represented in the IR as subexpression trees, often held in subject registers). The "+" node is itself referenced by the subject register whose value it defines register for %r1, the instruction's abstract For example, the center-right destination register). portion of Figure 20 shows the IR tree corresponding to the X86 instruction "add %ecx, %edx".

As those skilled in the art may appreciate, in one embodiment the translator 19 is implemented using object-oriented programming language such as C++. example, an IR node is implemented as a C++ object, and other nodes are implemented references to references to the C++ objects corresponding to those other is therefore implemented tree An IR objects, containing various collection οf IR node references to each other.

Further, in the embodiment under discussion, IR generation uses a set of abstract registers. These abstract registers correspond to specific features of the subject architecture. For example, there is a unique abstract register for each physical register on the subject architecture ("subject register"). Similarly, there is a unique abstract register for each condition code flag present on the subject architecture. Abstract registers serve as placeholders for IR trees during IR generation. For example, the value of subject register %r2 at a given point in the subject instruction sequence

is represented by a particular IR expression tree, which is associated with the abstract register for subject register %r2. In one embodiment, an abstract register is implemented as a C++ object, which is associated with a particular IR tree via a C++ reference to the root node object of that tree.

In the example instruction sequence described above, the translator has already generated IR trees 10 corresponding to the values of %r2 and %r3 while parsing the subject that instructions precede the "add" instruction. In other words, the subexpressions that calculate the values of %r2 and %r3 are represented as IR trees. When generating the IR tree for the "add %r1, %r2, %r3" instruction, the new "+" node 15 contains references to the IR subtrees for %r2 and %r3. The implementation of the abstract registers is divided between components in both the translator code 19 and the translated code 21. Within the translator 19, "abstract register" is a placeholder used in the course of 20 generation, such that the abstract register associated with the IR tree that calculates the value of the subject register to which the particular abstract register corresponds. As such, abstract registers in the 25 translator may be implemented as a C++ object which contains a reference to an IR node object (i.e., an IR The aggregate of all IR trees referred to by the abstract register set is referred to as the working IR forest ("forest" because it contains multiple abstract 30 register roots, each of which refers to an IR tree). working IR forest represents a snapshot of the abstract operations of the subject program at a particular point in the subject code.

(

10

15

20

25

30

Within the translated code 21, an "abstract register" is a specific location within the global register store, to and from which subject register values are synchronized with the actual target registers. Alternatively, when a value has been loaded from the global register store, an abstract register in the translated code 21 could be understood to be a target register 15, which temporarily holds a subject register value during the execution of the translated code 21, prior to being saved back to the register store.

An example of program translation as described above illustrated in Figure 2. Figure 2 shows translation of two basic blocks of x86 instructions, and the corresponding IR trees that are generated in The left side of Figure 2 shows process of translation. translator 19 execution path of the translation. In step 151, the translator 19 translates a first basic block 153 of subject code into target code 21 and then, in step 155, executes that target code 21. When the target code 21 finishes execution, control is returned to the translator 19, step 157, wherein the translator translates the next basic block 159 of subject code 17 into target code 21 and then executes that target code 21, step 161, and so on.

In the course of translating the first basic block 153 of subject code into target code, the translator 19 generates an IR tree 163 based on that basic block 153. In this case, the IR tree 163 is generated from the source instruction "add %ecx, %edx," which is a flag-affecting instruction. In the course of generating the IR tree 163,

four abstract registers are defined by this instruction: the destination abstract register %ecx 167, the first flag-affecting instruction parameter 169, the second flag-affecting instruction parameter 171, and the flag-affecting instruction result 173. The IR tree corresponding to the "add" instruction is a "+" operator 175 (i.e., arithmetic addition), whose operands are the subject registers %ecx 177 and %edx 179.

Thus, emulation of the first basic block 153 puts the flags in a pending state by storing the parameters and result of the flag-affecting instruction. The flagaffecting instruction is "add %ecx, %edx." The parameters of the instruction are the current values of emulated subject registers %ecx 177 and %edx 179. The "@" symbol preceding the subject register uses 177, 179 indicate that the values of the subject registers are retrieved from the global register store, from the locations corresponding to %ecx and %edx, respectively, as these particular subject registers were not previously loaded by the current basic block. These parameter values are then stored in the first and second flag parameter abstract registers 169, The result of the addition operation 175 is stored in the flag result abstract register 173.

25

30

5

10

15

20

After the IR tree is generated, the corresponding target code 21 is generated based on the IR. The process of generating target code 21 from a generic IR is well understood in the art. Target code is inserted at the end of the translated block to save the abstract registers, including those for the flag result 173 and the flag parameters 169, 171, to the global register store 27.

(

10

15

20

25

30

After the target code is generated, it is then executed, step 155.

Figure 2 shows an example of translation and execution interlaced. The translator 19 first generates translated code 21 based on the subject instructions 17 of a first basic block 153, then the translated code for basic block 153 is executed. At the end of the first basic block 153, the translated code 21 returns control to the translator 19, which then translates a second basic block 159. The translated code 21 for the second basic block 161 is then executed. At the end of the execution of the second basic block 159, the translated code returns control to the translator 19, which then translates the next basic block, and so forth.

Thus, a subject program running under the translator 19 has two different types of code that execute in an interleaved manner: the translator code 19 and the translated code 21. The translator code 19 is generated by a compiler, prior to run-time, based on the high-level source code implementation of the translator 19. The translated code 21 is generated by the translator code 19, throughout run-time, based on the subject code 17 of the program being translated.

The representation of the subject processor state is likewise divided between the translator 19 and translated code 21 components. The translator 19 stores subject processor state in a variety of explicit programming language devices such as variables and/or objects; the compiler used to compile the translator determines how the state and operations are implemented in target code. The

translated code 21, by comparison, stores subject processor state implicitly in target registers and memory locations, which are manipulated directly by the target instructions of the translated code 21.

For example, the low-level representation of the global register store 27 is simply a region of allocated memory. This is how the translated code 21 sees and interacts with the abstract registers, by saving and restoring between the defined memory region and various target registers. In the source code of the translator 19, however, the global register store 27 is a data array or an object which can be accessed and manipulated at a higher level. With respect to the translated code 21, there simply is no high-level representation.

In some cases, subject processor state which is static or statically determinable in the translator 19 is encoded directly into the translated code 21 rather than being calculated dynamically. For example, the translator 19 may generate translated code 21 that is specialized on the instruction type of the last flag-affecting instruction, meaning that the translator would generate different target code for the same basic block if the instruction type of the last flag-affecting instruction changed.

The translator 19 contains data structures corresponding to each basic block translation, which particularly facilitates extended basic block, isoblock, group block, and cached translation state optimizations as hereafter described. Figure 3 illustrates such a basic block data structure 30, which includes a subject address 31, a target code pointer 33 (i.e., the target address of

1

10

15

20

25

30

the translated code), translation hints 34, entry and exit conditions 35, a profiling metric 37, references to the data structures of the predecessor and successor basic blocks 38, 39, and an entry register map 40. Figure 3 further illustrates the basic block cache 23, which is a collection of basic block data structures, e.g., 30, 41, 42, 43, 44 . . . indexed by subject address. embodiment, the data corresponding to particular a translated basic block may be stored in a C++ object. translator creates a new basic block object as the basic block is translated.

The subject address 31 of the basic block is the starting address of that basic block in the memory space of the subject program 17, meaning the memory location where the basic block would be located if the subject program 17 were running on the subject architecture. This is also referred to as the subject starting address. While each basic block corresponds to a range of subject addresses (one for each subject instruction), the subject starting address is the subject address of the first instruction in the basic block.

The target address 33 of the basic block is the memory location (starting address) of the translated code 21 in the target program. The target address 33 is also referred to as the target code pointer, or the target starting address. To execute a translated block, the translator 19 treats the target address as a function pointer which is dereferenced to invoke (transfer control to) the translated code.

The basic block data structures 30, 41, 42, 43, . . . are stored in the basic block cache 23, which is a

repository of basic block objects organized by subject address. When the translated code of a basic block finishes executing, it returns control to the translator and also returns the value of the basic block's destination (successor) subject address 31 to translator. To determine if the successor basic block has already been translated, the translator 19 compares the destination subject address 31 against the addresses 31 of basic blocks in the basic block cache 23 (i.e., those that have already been translated). 10 blocks which have not been yet translated are translated and then executed. Basic blocks which have already been translated (and which have compatible entry conditions, as discussed below) are simply executed. Over time, many of the basic blocks encountered will already have been 15 translated, which causes the incremental translation cost to decrease. As such, the translator 19 gets faster over time, as fewer and fewer blocks require translation.

## 20 Extended Basic Blocks

25

30

One optimization applied according to the illustrative embodiment is to increase the scope of code generation by a technique referred to as "extended basic blocks." In cases where a basic block A has only one successor block (e.g., basic block B), the translator may be able to statically determine (when A is decoded) the subject address of B. In such cases, basic blocks A and B are combined into a single block (A') which is referred to as an extended basic block. Put differently, the extended basic block mechanism can be applied to unconditional jumps whose destination is statically determinable; if a jump is conditional or if the destination cannot be

statically determined, then a separate basic block must be formed. An extended basic block may still formally be a basic block, because after the intervening jump from A to B is removed, the code of block A' has only a single flow of control, and therefore no synchronization is necessary at the AB boundary.

Even if A has multiple possible successors including B, extended basic blocks may be used to extend A into B for a particular execution in which B is the actual successor and B's address is statically determinable.

10

15

20

25

30

Statically determinable addresses are those the decode-time. During determine at translator can forest, an IR construction of a block's IR constructed for the destination subject address, which is associated with the destination address abstract register. If the value of destination address IR tree is statically determinable (i.e., does not depend on dynamic or run-time subject register values), then the successor block is statically determinable. For example, in the case of an unconditional jump instruction, the destination address (i.e., the subject starting address of the successor block) is implicit in the jump instruction itself; the subject address of the jump instruction plus the offset encoded in the jump instruction equals the destination address. Likewise, the optimizations of constant folding (e.g.,  $X + (2 + 3) \Rightarrow X + 5$ ) and expression folding (e.g., (X \* 5) \* 10 => X \* 50) may cause an otherwise "dynamic" destination address to become statically determinable. The calculation of the destination address thus consists of extracting the constant value from the destination address IR.

When extended basic block A' is created, the translator subsequently treats it the same as any other basic block when performing IR generation, optimizations, and code generation. Because the code generation algorithms are operating on a larger scope (i.e., the code of basic blocks A and B combined), the translator 19 generates more optimal code.

As one of ordinary skill in the art will appreciate, 10 decoding is the process of extracting individual subject instructions from the subject code. The subject code is stored as an unformatted byte stream (i.e., a collection of bytes in memory). In the case of subject architectures with variable-length instructions (e.g., X86), decoding 15 identification of instruction requires the boundaries; in the case of fixed-length instruction identifying instruction boundaries architectures, the MIPS, every four bytes (e.g., on an trivial The subject instruction format is then 20 instruction). applied to the bytes that constitute a given instruction to extract the instruction data (i.e., the instruction type, operand register numbers, immediate field values, and any other information encoded in the instruction). The process of decoding machine instructions of a known 25 architecture from an unformatted byte stream using that architecture's instruction format is well understood in the art.

30 Figure 4 illustrates the creation of an extended basic block. A set of constituent basic blocks which is eligible to become an extended basic block is detected when the earliest eligible basic block (A) is decoded. If

is the translator 19 detects that A's successor (B) statically determinable 51, it calculates B's starting address 53 and then resumes the decoding process at the starting address of B. If B's successor (C) is determined to be statically determinable 55, the decoding process proceeds to the starting address of C, and so forth. block is statically not successor if а determinable then normal translation and execution resume 61, 63, 65.

10

15

20

25

30

5

During all basic block decoding, the working IR forest includes an IR tree to calculate the subject address 31 of the current block's successor (i.e., the destination subject address; the translator has a dedicated abstract register for the destination address). In the case of an extended basic block, to compensate for the fact that intervening jumps are being eliminated, each new as constituent basic block is assimilated by the decoding process, the IR tree for the calculation of that block's subject address is pruned 54 (Figure 4). In other words, when the translator 19 statically calculates B's address and decoding resumes at B's starting address, the IR tree corresponding to the dynamic calculation of B's subject (which was constructed in the course of address 31 decoding A) is pruned; when decoding proceeds to starting address of C, the IR tree corresponding to C's subject address is pruned 59; and so forth. "Pruning" an IR tree means to remove any IR nodes which are depended on by the destination address abstract register and by no other abstract registers. Put differently, pruning breaks the link between the IR tree and the destination abstract register; any other links to the same IR tree remain In some cases, a pruned IR tree may also be unaffected.

depended on by another abstract register, in which case the IR tree remains to preserve the subject program's execution semantics.

To prevent code explosion (traditionally, the mitigating factor against such code specialization techniques), the translator limits extended basic blocks to some maximum number of subject instructions. In one embodiment, extended basic blocks are limited to a maximum of 200 subject instructions.

#### Isoblocks

5

10

Another optimization implemented in the illustrated embodiment is so-called "isoblocking." According to this 15 technique, translations of basic blocks are parameterized, or specialized, on a compatibility list, which is a set of variable conditions that describe the subject processor The compatibility list is state and the translator state. different for each subject architecture, to take into 20 account different architectural features. The actual values of the compatibility conditions at the entry and exit of a particular basic block translation are referred to as entry conditions and exit conditions, respectively. If execution reaches a basic block which has already been 25 translated but the previous translation's entry conditions differ from the current working conditions (i.e., the exit conditions of the previous block), then the basic block must be translated again, this time based on the current working conditions. The result is that the same subject code basic block is now represented by multiple target code translations. These different translations of the same basic block are referred to as isoblocks.

To support isoblocks, the data associated with each basic block translation includes one set of entry conditions 35 and one set of exit conditions 36 (Figure 3). In one embodiment, the basic block cache 23 is organized first by subject address 31 and then by entry conditions 35, 36 (Figure 3). In another embodiment, when the translator queries the basic block cache 23 for a subject address 31, the query may return multiple translated basic blocks (isoblocks).

10

15

20

25

30

Figure 5 illustrates the use of isoblocks. At the end of a first translated block's execution, the translated code 21 calculates and returns the subject address of the next block (i.e., the successor) 71. Control is then returned to the translator 19, as demarcated by dashed line 73. In the translator 19, the basic block cache 23 is queried using the returned subject address 31, step 75. The basic block cache may return zero, one, or more than one basic block data structures with the same subject address 31. If the basic block cache 23 returns zero data structures (meaning that this basic block has not yet been translated), then the basic block must be translated, step 77, by the translator 19. Each data structure returned by the basic block cache 23 corresponds to a different translation (isoblock) of the same basic block of subject As illustrated at decision diamond 79, if the code. current exit conditions (of the first translated block) do match the entry conditions of any of the data structures returned by the basic block cache 23, then the basic block must be translated again, step 81, this time parameterized on those exit conditions. If the current exit conditions match the entry conditions of one of the data structures returned by the basic block cache 23, then that translation is compatible and can be executed without re-translation, step 83. In the illustrative embodiment, the translator 19 executes the compatible translated block by dereferencing the target address as a function pointer. As noted above, basic block translations are preferably parameterized on a compatibility list. Exemplary compatibility lists will now be described for both the X86 and PowerPC architectures.

10

15

An illustrative compatibility list for the X86 architecture includes representations of: (1) lazy propagation of subject registers; (2) overlapping abstract registers; (3) type of pending condition code flagaffecting instruction; (4) lazy propagation of condition code flag-affecting instruction parameters; (5) direction of string copy operations; (6) floating point unit (FPU) mode of the subject processor; and (7) modifications of the segment registers.

20

25

30

The compatibility list for the X86 architecture includes representations of any lazy propagation subject registers by the translator, also referred to as register aliasing. Register aliasing occurs when the translator knows that two subject registers contain the same value at a basic block boundary. As long as the subject register values remain the same, only one of the corresponding abstract registers is synchronized, saving it to the global register store. Until the saved subject register is overwritten, references to the nonsaved register simply use or copy (via a move instruction) the saved register. This avoids two memory accesses (save + restore) in the translated code.

The compatibility list for the X86 architecture includes representations of which of the overlapping abstract registers are currently defined. In some cases, the subject architecture contains multiple overlapping subject registers which the translator represents using multiple overlapping abstract registers. For example, variable-width subject registers are represented using multiple overlapping abstract registers, one for each access size. For example, the X86 "EAX" register can be accessed using any of the following subject registers, each of which has a corresponding abstract register: EAX (bits 31...0), AX (bits 15...0), AH (bits 15...8), and AL (bits 7...0).

15

20

25

30

10

The compatibility list for the X86 architecture includes representations of, for each integer and floating point condition code flag, whether the flag value is normalized or pending, and if pending the type of the pending flag-affecting instruction.

the X86 architecture The compatibility list for register aliasing of representations includes condition code flag-affecting instruction parameters (if some subject register still holds the value of a flagaffecting instruction parameter, or if the value of the The same as the first). second parameter is the representations οf compatibility list includes also whether the second parameter is a small constant (i.e., an immediate instruction candidate), and if so its value.

The compatibility list for the X86 architecture includes a representation of the current direction of

string copy operations in the subject program. This condition field indicates whether string copy operations move upward or downward in memory. This supports code specialization of "strcpy()" function calls, by parameterizing translations on the function's direction argument.

The compatibility list for the X86 architecture includes a representation of the FPU mode of the subject processor. The FPU mode indicates whether subject floating-point instructions are operating in 32- or 64-bit mode.

compatibility list for the X86 architecture 15 includes a representation of modifications of the segment All X86 instruction memory references are based on one of six memory segment registers: CS (code segment), DS (data segment), SS (stack segment), ES (extra segment), (general purpose FS segment), 20 (general purpose segment). Under normal circumstances an application will not modify the segment registers. such, code generation is by default specialized on the assumption that the segment register values constant. It is possible, however, for a program to 25 modify its segment registers, in which case the corresponding segment register compatibility bit will be causing the translator to generate generalized memory accesses using the appropriate segment register's dynamic value.

30

(

10

An illustrative embodiment of a compatibility list for the PowerPC architecture includes representations of: (1) mangled registers; (2) link value propagation; (3) type of pending condition code flag-affecting instruction; (4) lazy propagation of condition code flag-affecting instruction parameters; (5) condition code flag value aliasing; and (6) summary overflow flag synchronization state.

5

10

15

20

25

The compatibility list for the PowerPC architecture includes a representation of mangled registers. In cases where the subject code contains multiple consecutive memory accesses using a subject register for the base translator may translate those memory address, the accesses using a mangled target register. In cases where subject program data is not located at the same address in target memory as it would have been in subject memory, the translator must include a target offset in every memory address calculated by the subject code. While the subject register contains the subject base address, a mangled target register contains the target address corresponding to that subject base address (i.e., subject base address + With register mangling, memory accesses target offset). can be translated more efficiently by applying the subject code offsets directly to the target base address, stored By comparison, without the in the mangled register. mangled register mechanism this scenario would require additional manipulation of the target code for each memory access, at the cost of both space and execution time. compatibility list indicates which abstract registers if any are mangled.

The compatibility list for the PowerPC architecture includes a representation of link value propagation. For leaf functions (i.e., functions that call no other functions), the function body may be extended (as with the

extended basic block mechanism discussed above) into the call/return site. Hence, the function body and the code follows the function's return are translated together. This is also referred to as function return specialization, because such a translation includes code and is therefore specialized on, the function's return site. Whether a particular block translation used link value propagation is reflected in the conditions. As such, when the translator encounters a block whose translation used link value propagation, must evaluate whether the current return site will be the same as the previous return site. Functions return to the same location from which they are called, so the call site and return site are effectively the same (offset by one or two instructions). The translator can therefore determine whether the return sites are the same by comparing the respective call sites; this is equivalent to comparing the subject addresses of the respective predecessor blocks (of the function block's prior and current executions). such, in embodiments that support link value propagation, the data associated with each basic block translation includes a reference to the predecessor block translation (or some other representation of the predecessor block's subject address).

25

30

10

15

20

The compatibility list for the PowerPC architecture includes representations of, for each integer and floating point condition code flag, whether the flag value is normalized or pending, and if pending the type of the pending flag-affecting instruction.

The compatibility list for the PowerPC architecture includes representations of register aliasing for flag-affecting instruction parameters (if flag-affecting

instruction parameter values happen to be live in a subject register, or if the value of the second parameter is the same as the first). The compatibility list also includes representations of whether the second parameter is a small constant (i.e., an immediate instruction candidate), and if so its value.

5

10

15

20

25

30

The compatibility list for the PowerPC architecture includes representations of register aliasing for the The PowerPC PowerPC condition code flag values. architecture includes instructions for explicitly loading the entire set of PowerPC flags into a general purpose (subject) register. This explicit representation of the subject flag values in subject registers interferes with emulation code flag the translator's condition list contains compatibility optimizations. The representation of whether the flag values are live in a subject register, and if so which register. generation, references to such a subject register while it holds the flag values are translated into references to the corresponding abstract registers. This mechanism eliminates the need to explicitly calculate and store the subject flag values in a target register, which in turn allows the translator to apply the standard condition code flag optimizations.

The compatibility list for the PowerPC architecture includes a representation of summary overflow synchronization. This field indicates which of the eight summary overflow condition bits are current with the global summary overflow bit. When one of the PowerPC's eight condition fields is updated, if the global summary

overflow is set, it is copied to the corresponding summary overflow bit in the particular condition code field.

### Translation Hints

5

10

15

20

25

Another optimization implemented in the illustrative embodiment employs the translation hints 34 of the basic block data structure of Figure 3. This optimization proceeds from a recognition that there is static basic block data which is specific to a particular basic block, but which is the same for every translation of that block. For some types of static data which are expensive to calculate, it is more efficient for the translator to calculate the data once, during the first translation of the corresponding block, and then store the result for future translations of the same block. Because this data is the same for every translation of the same block, it does not parameterize translation and therefore it is not formally part of the block's compatibility list (discussed Expensive static data is still stored in the data associated with each basic block translation, however, as it is cheaper to save the data than it is to recalculate. In later translations of the same block, even if the translator 19 cannot reuse a prior translation, translator 19 can take advantage of these "translation (i.e., the cached static data) to reduce the translation cost of the second and later translations.

In one embodiment, the data associated with each basic block translation includes translation hints, which are calculated once during the first translation of that block and then copied (or referred to) on each subsequent translation.

For example, in a translator 19 implemented in C++, translation hints may be implemented as a C++ object, in which case the basic block objects which correspond to different translations of the same block would each store a reference to the same translation hints object. Alternatively, in a translator implemented in C++, the basic block cache 23 may contain one basic block object per subject basic block (rather than per translation), with each such object containing or holding a reference to the corresponding translation hints; such basic block objects also contain multiple references to translation objects that correspond to different translations of that block, organized by entry conditions.

15

20

25

10

Exemplary translation hints for the X86 architecture (1) initial instruction include representations of: (2) and initial repeat prefixes. prefixes; translation hints for the X86 architecture particularly include a representation of how many prefixes the first instruction in the block has. Some X86 instructions have prefixes which modify the operation of the instruction. This architectural feature makes it difficult expensive) to decode an X86 instruction stream. Once the number of initial prefixes is determined during the first decoding of the block, that value is then stored by the translator 19 as a translation hint, so that subsequent translations of the same bock do not need to determine it anew.

30

The translation hints for the X86 architecture further include a representation of whether the first instruction in the block has a repeat prefix. Some X86 instructions

such as string operations have a repeat prefix which tells the processor to execute that instruction multiple times. The translation hints indicate whether such a prefix is present, and if so its value.

5

10

In one embodiment, the translation hints associated with each basic block additionally include the entire IR forest corresponding to that basic block. This effectively caches all of the decoding and IR generation performed by the frontend. In another embodiment, the translation hints include the IR forest as it exists prior to being optimized. In another embodiment, the IR forest is not cached as a translation hint, in order to conserve the memory resources of the translated program.

15

20

## Group Blocks

Another optimization implemented in the illustrative translator embodiment is directed to eliminating program overhead resulting from the necessity to synchronize all abstract registers at the end of execution of each translated basic block. This optimization is referred to as group block optimization.

As discussed above, in basic block mode (e.g., Figure 2), state is passed from one basic block to the next using a memory region which is accessible to all translated code sequences, namely, a global register store 27. The global register store 27 is a repository for abstract registers, each of which corresponds to and emulates the value of a particular subject register or other subject architectural feature. During the execution of translated code 21, abstract registers are held in target registers so that

they may participate in instructions. During the execution of translator code 21, abstract register values are stored in the global register store 27 or target registers 15.

5

10

15

20

25

30

in basic block mode such as illustrated in Figure 2, all abstract registers must be synchronized at (1) control the end of each basic block for two reasons: returns to the translator code 19, which potentially overwrites all target registers; and (2) because code generation only sees one basic block at a time, the translator 19 must assume that all abstract registers values are live (i.e., will be used in subsequent basic blocks) and therefore must be saved. The goal of the reduce is to block optimization mechanism synchronization across basic block boundaries that are crossed frequently, by translating multiple basic blocks By translating multiple basic as a contiguous whole. blocks together, the synchronization at block boundaries can be minimized if not eliminated.

Group block construction is triggered when the current block's profiling metric reaches a trigger threshold. block. the trigger block is referred to as This Construction can be separated into the following steps (Figure 6): (1) selecting member blocks 71; (2) ordering member blocks 73; (3) global dead code elimination 75; (4) global register allocation 77; and (5) code generation 79. The first step 71 identifies the set of blocks that are to be included in the group block by performing a depth-first (DFS) traversal of the program's control graph, beginning with the trigger block and tempered by an The inclusion threshold and a maximum member limit.

second step 73 orders the set of blocks and identifies the critical path through the group block, to enable efficient code layout that minimizes synchronization code and reduces branches. The third and fourth steps 75, 77 perform optimizations. The final step 79 generates target code for all member blocks in turn, producing efficient code layout with efficient register allocation.

10

15

20

25

30

In construction of a group block and generation of target code therefrom, the translator code 19 implements the steps illustrated in Figure 6. When the translator 19 encounters a basic block that was previously translated, prior to executing that block, the translator 19 checks the block's profiling metric 37 (Figure 3) against the The translator 19 begins group block trigger threshold. creation when a basic block's profiling metric 37 exceeds the trigger threshold. The translator 19 identifies the members of the group block by a traversal of the control flow graph, starting with the trigger block and tempered by the inclusion threshold and maximum member limit. Next, the translator 19 creates an ordering of the member blocks, which identifies the critical path through the group block. The translator 19 then performs global dead code elimination; the translator 19 gathers register liveness information for each member block, using the IR Next, the translator corresponding to each block. performs global register allocation according architecture-specific policy, which defines a partial set uniform register mappings for all member blocks. Finally, the translator 19 generates target code for each member block in order, consistent with the global register allocation constraints and using the register liveness analyses.

As noted above, the data associated with each basic block includes a profiling metric 37. In one embodiment, the profiling metric 37 is execution count, meaning that the translator 19 counts the number of times a particular basic block has been executed; in this embodiment, the profiling metric 37 is represented as an integer count In another embodiment, the profiling field (counter). metric 37 is execution time, meaning that the translator 19 keeps a running aggregate of the execution time for all 10 executions of a particular basic block, such as by planting code in the beginning and end of a basic block to start and stop, respectively, a hardware or software timer; in this embodiment, the profiling metric 37 uses some representation of the aggregate execution 15 In another embodiment, the translator 19 stores multiple types of profiling metrics 37 for each basic In another embodiment, the translator 19 stores multiple sets of profiling metrics 37 for each basic block, corresponding to each predecessor basic 20 and/or each successor basic block, such that distinct profiling data is maintained for different control paths. (i.e., the execution translator cycle each translator code 19 between executions of translated code 21), the profiling metric 37 for the appropriate basic 25 block is updated.

In embodiments that support group blocks, the data associated with each basic block additionally includes references 38, 39 to the basic block objects of known references in successors. These predecessors and control-flow graph of all aggregate constitute a previously executed basic blocks. During group block

30

formation, the translator 19 traverses this control-flow graph to determine which basic blocks to include in the group block under formation.

Group block formation in the illustrative embodiment 5 a trigger threshold, is based on three thresholds: inclusion threshold, and a maximum member limit. trigger threshold and the inclusion threshold refer to the profiling metric 37 for each basic block. translator cycle, the profiling metric 37 of the next 10 basic block is compared to the trigger threshold. If the metric 37 meets the trigger threshold then group block The inclusion threshold is then used to formation begins. determine the scope of the group block, by identifying which successor basic blocks to include in the group 15 The maximum member limit defines the upper limit on the number of basic blocks to be included in any one group block.

When the trigger threshold is reached for basic block 20 a new group block is formed with A as the trigger The translator 19 then begins the definition traversal, a traversal of A's successors in the controlflow graph to identify other member blocks to include. When traversal reaches a given basic block, its profiling 25 metric 37 is compared to the inclusion threshold. metric 37 meets the inclusion threshold, that basic block is marked for inclusion and the traversal continues to the block's successors. If the block's metric 37 is below the excluded and inclusion threshold, that block is 30 When traversal ends (i.e., successors are not traversed. all paths either reach an excluded block or cycle back to maximum member included block, or the limit is an

reached), the translator 19 constructs a new group block based on all of the included basic blocks.

In embodiments that use isoblocks and group blocks, the control flow graph is a graph of isoblocks, meaning that different isoblocks of the same subject block are treated as different blocks for the purposes of group block creation. Thus, the profiling metrics for different isoblocks of the same subject block are not aggregated.

10

15

20

25

30

In another embodiment, isoblocks are not used in basic block translation but are used in group block translation, meaning that non-group basic block translations generalized (not specialized on entry conditions). In this embodiment, a basic block's profiling metric disaggregated by the entry conditions of each execution, such that distinct profiling information is maintained for each theoretical isoblock (i.e., for each distinct set of In this embodiment, the entry conditions). associated with each basic block includes a profiling list, each member of which is a three-item set containing: a set of entry conditions, (2) a corresponding of corresponding list profiling metric, and (3) a This data maintains profiling and successor blocks. control path information for each set of entry conditions to the basic block, even though the actual basic block translation is not specialized on those entry condition. In this embodiment, the trigger threshold is compared to each profiling metric within a basic block's profiling metric list. When the control flow graph is traversed, each element in a given basic block's profiling list is treated as a separate node in the control flow graph. inclusion threshold is therefore compared against each profiling metric in the block's profiling list. In this embodiment, group blocks are created for particular hot isoblocks (specialized to particular entry conditions) of hot subject blocks, but other isoblocks of those same subject blocks are executed using the general (non-isoblock) translations of those blocks.

After the definition traversal, the translator performs an ordering traversal, step 73; Figure 6, to determine the order in which member blocks will translated. The order of the member blocks affects both the instruction cache behavior of the translated code 21 (hot paths should be contiguous) and the synchronization necessary on member block boundaries (synchronization should be minimized along hot paths). In one embodiment, the translator 19 performs the ordering traversal using an ordered depth-first search (DFS) algorithm, ordered by execution count. Traversal starts at the member block having the highest execution count. If a traversed member block has multiple successors, the successor with the higher execution count is traversed first.

10

15

20

25

30

One of ordinary skill in the art will appreciate that group blocks are not formal basic blocks, as they may have internal control branches, multiple entry points, and/or multiple exit points.

Once a group block has been formed, a further optimization may be applied to it, referred to herein as "global dead code elimination." Such global dead code elimination employs the technique of liveness analysis. Global dead code elimination is the process of removing redundant work from the IR across a group of basic blocks.

Generally, subject processor state must be synchronized on translation scope boundaries. A value, such as a subject register, is said to be "live" for the range of code starting with its definition and ending with its last use hence, prior to being re-defined (overwritten); analysis of values' (e.g., temporary values in the context of IR generation, target registers in the context of code generation, or subject registers in the context translation) uses and definitions is known in the art as Whatever knowledge (i.e., liveness liveness analysis. analysis) the translator has regarding the uses (reads) and definitions (writes) of data and state is limited to its translation scope; the rest of the program is an More specifically, because the translator does unknown. not know which subject registers will be used outside the scope of translation (e.g., in a successor basic block), it must assume that all registers will be used. the values (definitions) of any subject registers which were modified within a given basic block must be saved (stored to the global register store 27) at the end of that basic block, against the possibility of their future Likewise, all subject registers whose values will be used in a given basic block must be restored (loaded from the global register store 27) at the beginning of that basic block; i.e., the translated code for a basic block must restore a given subject register prior to its first use within that basic block.

10

15

20

25

The general mechanism of IR generation involves an implicit form of "local" dead code elimination, whose scope is localized to only a small group of IR nodes at once. For example, a common subexpression A in the subject code would be represented by a single IR tree for

with multiple parent nodes, rather than multiple instances of the expression tree Α itself. The "elimination" is implicit in the fact that one IR node can have links to multiple parent nodes. Likewise, the use of abstract registers as IR placeholders is an implicit form of dead code elimination. If the subject code for a given basic block never defines a particular subject register, then at the end of IR generation for that block, the abstract register corresponding to that subject register will refer to an empty IR tree. The code generation phase recognizes that, in this scenario, the appropriate abstract register need not be synchronized with the global register store. As such, local dead code elimination is implicit in the IR generation phase, occurring incrementally as IR nodes are created.

10

15

20

25

30

In contrast to local dead code elimination, a "global" dead code elimination algorithm is applied to a basic block's entire IR expression forest. Global dead code elimination according to the illustrative embodiment requires liveness analysis, meaning analysis of subject register uses (reads) and subject register definitions (writes) within the scope of each basic block in a group block, to identify live and dead regions. The IR is transformed to remove dead regions and thereby reduce the amount of work that must be performed by the target code. For example, at a given point in the subject code, if the translator 19 recognizes or detects that a particular subject register will be defined (overwritten) before its next use, the subject register is said to be dead at all points in the code up to that preempting definition. terms of the IR, subject registers which are defined but never used before being re-defined are dead code which can be eliminated in the IR phase without ever spawning target code. In terms of target code generation, target registers which are dead can be used for other temporary or subject register values without spilling.

5

10

15

20

25

30

In group block global dead code elimination, liveness analysis is performed on all member blocks. Liveness analysis generates the IR forest for each member block, which is then used to derive the subject register liveness information for that block. IR forests for each member block are also needed in the code generation phase of group block creation. Once the IR for each member block is generated in liveness analysis, it can either be saved for subsequent use in code generation, or it can be deleted and re-generated during code generation.

code elimination dead block qlobal Group effectively "transform" the IR in two ways. First, the IR forest generated for each member block during liveness analysis can be modified, and then that entire IR forest can be propagated to (i.e., saved and reused during) the the IR this scenario, phase; in code generation transformations are propagated through the code generation phase by applying them directly to the IR forest and then saving the transformed IR forest. In this scenario, the data associated with each member block includes liveness information (to be additionally used in global register allocation), and the transformed IR forest for that block. Alternatively and preferably, the step of global dead code elimination which transforms the IR for a member block is performed during the final code generation phase of group information block creation, using liveness In this embodiment, the global dead code earlier.

transformations can be recorded as list of "dead" subject is then the liveness which encoded in information associated with each member block. transformation of the IR forest is thus performed by the subsequent code generation phase, which uses the dead register list to prune the IR forest. This scenario allows the translator to generate the IR once during liveness analysis, then throw the IR away, and then regenerate the same IR during the code generation, at which point the IR is transformed using the liveness analysis (i.e., global dead code elimination is applied to the IR In this scenario, the data associated with each itself). member block includes liveness information, which includes a list of dead subject registers. The IR forest is not saved. Specifically, after the IR forest is (re)generated in the code generation phase, the IR trees for dead subject registers (which are listed in the dead subject register list within the liveness information) are pruned. In one embodiment, the IR created during liveness analysis liveness information thrown away after the extracted, to conserve memory resources. The IR forests block) are recreated during code per member generation, one member block at a time. In this embodiment, the IR forests for all member blocks do not coexist at any point in translation. However, the two versions of the IR forests, created during liveness analysis and code generation, respectively, are identical, as they are generated from the subject code using the same IR generation process.

30

10

15

20

25

In another embodiment, the translator creates an IR forest for each member block during liveness analysis, and then saves the IR forest, in the data associated with each

member block, to be reused during code generation. In this embodiment, the IR forests for all member blocks coexist, from the end of liveness analysis (in the global dead code elimination step) to code generation. In one alternative of this embodiment, no transformations or optimizations are performed on the IR during the period from its initial creation (during liveness analysis) and its last use (code generation).

In another embodiment, the IR forests for all member 10 blocks are saved between the steps of liveness analysis and code generation, and inter-block optimizations are performed on the IR forests prior to code generation. this embodiment, the translator takes advantage of the fact that all member block IR forests coexist at the same 15 point in translation, and optimizations are performed across the IR forests of different member blocks which transform those IR forests. In this case, the IR forests used in code generation may not be identical to the IR in the forests used in liveness analysis (as 20 embodiments described above), because the IR forests have by inter-block subsequently transformed In other words, the IR forests used in optimizations. code generation may be different than the IR forests that would result from generating them anew one member block at 25 a time.

In group block global dead code elimination, the scope of dead code detection is increased by the fact that liveness analysis is applied to multiple blocks at the same time. Hence, if a subject register is defined in the first member block, and then redefined in the third member block (with no intervening uses or exit points), the IR

30

tree for the first definition can be eliminated from the first member block. By comparison, under basic block code generation, the translator 19 would be unable to detect that this subject register was dead.

5

10

15

20

25

30

As noted above, one goal of group block optimization is to reduce or eliminate the need for register synchronization at basic block boundaries. Accordingly, a discussion of how register allocation and synchronization is achieved by the translator 19 during group blocking is now provided.

Register allocation is the process of associating an (subject) register with a target register. abstract Register allocation is a necessary component of code generation, as abstract register values must reside in target registers to participate in target instructions. The representation of these allocations (i.e., mappings) abstract registers registers and target During code generation, referred to as a register map. the translator 19 maintains a working register map, which reflects the current state of register allocation (i.e., target-to-abstract register mappings actually existence at a given point in the target code). will be had hereafter to an exit register map which is, abstractly, a snapshot of the working register map on exit from a member block. However, since the exit register map is not needed for synchronization, it is not recorded so it is purely abstract. The entry register map 40 (Figure 3) is a snapshot of the working register map on entry to a is necessary to member block, which synchronization purposes.

Also, as discussed above, a group block contains multiple member blocks, and code generation is performed separately for each member block. As such, each member block has its own entry register map 40 and exit register map, which reflect the allocation of particular target registers to particular subject registers at the beginning and end, respectively, of the translated code for that block.

Code generation for a group member block is 10 parameterized by its entry register map 40 (the working register map on entry), but code generation also modifies The exit register map for a the working register map. member block reflects the working register map at the end of that block, as modified by the code generation process. 15 When the first member block is translated, the working register register is empty (subject to global map allocation, discussed below). At the end of translation for the first member block, the working register map contains the register mappings created by the code 20 The working register map is then generation process. copied into the entry register maps 40 of all successor member blocks.

At the end of code generation for a member block, some 25 require synchronization. may not abstract registers allow the translator 19 to maps Register synchronization on member block boundaries, by identifying which registers actually require synchronization. comparison, in the (non-group) basic block scenario all 30 abstract registers must be synchronized at the end of every basic block.

At the end of a member block, three synchronization scenarios are possible based on the successor. the successor is a member block which has not yet been translated, its entry register map 40 is defined to be the as the working register map, with the consequence that no synchronization is necessary. Second, successor block is external to the group, then all abstract registers must be synchronized (i.e., a full synchronization) because control will return to the translator code 19 before the successor's execution. Third, if the successor block is a member block whose register map has already been fixed, then synchronization code must be inserted to reconcile the working map with the successor's entry map.

15

20

25

10

Some of the cost of register map synchronization is reduced by the group block ordering traversal, which minimizes register synchronization or eliminates entirely along hot paths. Member blocks are translated in the order generated by the ordering traversal. member block is translated, its exit register map propagated into the entry register map 40 of all successor member blocks whose entry register maps are not yet fixed. In effect, the hottest path in the group translated first, and most if not all member block boundaries along that path require no synchronization because the corresponding register maps are all consistent.

For example, the boundary between the first and second member blocks will always require no synchronization, because the second member block will always have its entry register map 40 fixed to be the same as the exit register

map 41 of the first member block. Some synchronization between member blocks may be unavoidable because group blocks can contain internal control branches and multiple entry points. This means that execution may reach the same member block from different predecessors, with different working register maps at different times. These cases require that the translator 19 synchronize the working register map with the appropriate member block's entry register map.

10

15

20

25

If required, register map synchronization occurs on The translator 19 inserts code member block boundaries. at the end of a member block to synchronize the working register map with the successor's entry register map 40. In register map synchronization, each abstract register falls under one of ten synchronization conditions. 1 illustrates the ten register synchronization cases as a function of the translator's working register map and the successor's entry register map 40. Table 2 describes the register synchronization algorithm, by enumerating the ten formal synchronization cases with text descriptions of the cases and pseudo-code descriptions of the corresponding synchronization actions (the pseudo-code is explained Thus, at every member block boundary, every below). register is synchronized using the 10-case This detailed articulation of synchronization algorithm. allows the translator conditions actions and generate efficient synchronization code, which minimizes the synchronization cost for each abstract register.

30

The following describes the synchronization action functions listed in Table 2. "Spill(E(a))" saves abstract register a from target register E(a) into the subject

register bank (a component of the global register store). "Fill(t,a)" loads abstract register a from the subject register bank into target register t. "Reallocate()" moves and reallocates (i.e., changes the mapping of) an abstract register to a new target register if available, or spills the abstract register if a target register is not available. "FreeNoSpill(t)" marks a target register as free without spilling the associated abstract subject The FreeNoSpill() function is necessary to avoid superfluous spilling across multiple applications of the algorithm at the same synchronization point. that for cases with a "Nil" synchronization action, no synchronization code is necessary for the corresponding abstract registers.

15

10

| LEGEND     |                                                               |  |  |  |  |  |
|------------|---------------------------------------------------------------|--|--|--|--|--|
| a          | abstract subject register                                     |  |  |  |  |  |
| t          | target register                                               |  |  |  |  |  |
| W          | <pre>working register map {W(a) =&gt; t}</pre>                |  |  |  |  |  |
| E          | entry register map {E(a) => t}                                |  |  |  |  |  |
| dom        | domain ·                                                      |  |  |  |  |  |
| rng        | range                                                         |  |  |  |  |  |
| € .        | is a member of                                                |  |  |  |  |  |
| ∉          | is not a member of                                            |  |  |  |  |  |
| W(a) ∉ rng | The working register for abstract register "a" is not in the  |  |  |  |  |  |
| E          | range of the entry register map. I.e., the target register    |  |  |  |  |  |
|            | that is currently mapped to abstract register "a" ("W(a)") is |  |  |  |  |  |
|            | not defined in the entry register map E.                      |  |  |  |  |  |

|   |   |     | a ∈ d                   | om | W   |   |  |          |          |      |    | a | ∉ | dom      |
|---|---|-----|-------------------------|----|-----|---|--|----------|----------|------|----|---|---|----------|
|   |   |     |                         |    |     |   |  | W        |          |      |    |   |   |          |
| a | € | dom | W(a) ∉ rng W(a) ∈ rng E |    |     | 2 |  |          |          |      |    |   |   |          |
| E |   |     |                         |    |     | E |  |          |          |      |    |   |   |          |
|   |   |     | E(a)                    | ∉  | rng | 6 |  | 8        |          | ,    |    | 4 |   |          |
|   |   |     | W                       |    |     |   |  |          | _        |      |    |   |   | <u>.</u> |
|   |   |     | E(a)                    | €  | rng | 7 |  | <br>W(a) | <b>≠</b> | E(a) | 9  | 5 |   |          |
|   |   |     | W                       |    |     |   |  | W(a)     |          | =    | 10 |   |   |          |
|   | • | ·   |                         |    |     |   |  | E(a)     |          |      |    |   |   |          |
| a | ∉ | dom |                         |    |     | 2 |  | 3        |          |      |    | 1 |   |          |
| E |   |     |                         |    |     |   |  |          |          |      |    |   |   |          |

Table 1: Enumeration of the 10 Register Synchronization Scenarios

Table 2: Register Map Synchronization Scenarios Action Description Case Nil W (...) a ∉ (dom E ∪ E(...) dom W) The abstract register is neither in the working rmap or the entry rmap. Spill(W(a)) W(a=>t1,...)  $a \in dom W$ E(...) The abstract register is in the working a ∉ dom E the entry rmap. rmap, but not in Furthermore the target register used in W(a) ∉ rng E the working rmap is not in the range of the entry rmap. Spill(W(a)) W(al=>t1,...) a ∈ dom W E(ax=>t1,...) The abstract register is in the working, a ∉ dom E but not in the entry rmap. However the target register used in the working rmap W(a) ∈ rng E is in the range of the entry rmap.

5

| Τε | Table 2: Register Map Synchronization Scenarios |                                           |                                       |  |  |  |  |
|----|-------------------------------------------------|-------------------------------------------|---------------------------------------|--|--|--|--|
|    | Case                                            | Description                               | Action                                |  |  |  |  |
| 4  | a ∉ dom ₩                                       | W ()                                      | Fill(E(a),                            |  |  |  |  |
|    | ^                                               | E(al=>t1,)                                | a)                                    |  |  |  |  |
|    | a ∈ dom E                                       | The abstract register is in the entry     |                                       |  |  |  |  |
|    |                                                 | rmap but not in the working rmap.         |                                       |  |  |  |  |
|    | E(a) ∉ rng W                                    | Furthermore the target register used in   |                                       |  |  |  |  |
|    | _ (,                                            | the entry rmap is not in the range of the |                                       |  |  |  |  |
|    |                                                 | working rmap.                             |                                       |  |  |  |  |
| 5  | a ∉ dom W                                       | W(ax=>t1,)                                | Reallocate(E                          |  |  |  |  |
|    | ^                                               | E(a1=>t1,)                                | (a))                                  |  |  |  |  |
|    | a ∈ dom E                                       | The abstract register is in the entry     | Fill(E(a),                            |  |  |  |  |
|    | _                                               | rmap but not in the working rmap. However | a)                                    |  |  |  |  |
|    | E(a) ∈ rng W                                    | the target register used in the entry     |                                       |  |  |  |  |
|    |                                                 | rmap is in the range of the working rmap. |                                       |  |  |  |  |
| 6  | a ∈ (dom W ∩                                    | W(al=>t1,)                                | Copy W(a) =>                          |  |  |  |  |
|    | dom E)                                          | E(a1=>t2,)                                | E(a)                                  |  |  |  |  |
|    | ^                                               | The abstract register is in the working   | FreeNoSpill(                          |  |  |  |  |
|    | W(a) ∉ rng E                                    | rmap and the entry rmap. However both     | W(a))                                 |  |  |  |  |
|    | ^                                               | use different target registers.           |                                       |  |  |  |  |
|    | E(a) ∉ rng W                                    | Furthermore the target register used in   |                                       |  |  |  |  |
|    | 2 (4) 2 2                                       | the working rmap is not in the range of   |                                       |  |  |  |  |
|    |                                                 | the entry rmap and the target register    |                                       |  |  |  |  |
|    |                                                 | used in the entry rmap is not in the      |                                       |  |  |  |  |
|    |                                                 | range of the working rmap.                |                                       |  |  |  |  |
| 7  | a ∈ (dom W ∩                                    | W(a1=>t1,ax=>t2)                          | Spill(E(a))                           |  |  |  |  |
|    | dom E)                                          | E(a1=>t2,)                                | Copy W(a) =>                          |  |  |  |  |
|    | ^                                               | The abstract register in the working rmap | E(a)                                  |  |  |  |  |
|    | W(a) ∉ rng E                                    | is in the entry rmap. However both use    | FreeNoSpill(                          |  |  |  |  |
|    | ^                                               | different target registers. The target    | W(a))                                 |  |  |  |  |
|    | E(a) ∈ rng W                                    | register used in the working rmap is not  |                                       |  |  |  |  |
|    | D(u) C Ing "                                    | in the range of the entry rmap, however   |                                       |  |  |  |  |
|    |                                                 | the target register used in the entry     |                                       |  |  |  |  |
|    |                                                 | rmap is in the range of the working rmap. |                                       |  |  |  |  |
| 8  | a ∈ (dom W ∩                                    | W(al=>t1,)                                | Copy W(a) =>                          |  |  |  |  |
|    | dom E)                                          | E(a1=>t2,ax=>t1,)                         | E(a)                                  |  |  |  |  |
|    | ^                                               | The abstract register in the working rmap | FreeNoSpill(                          |  |  |  |  |
|    | W(a) ∈ rng E                                    | is in the entry rmap. However both use    | W(a))                                 |  |  |  |  |
|    | ^                                               | different target registers. The target    |                                       |  |  |  |  |
|    | C(a) ∉ rng W                                    | register used in the entry rmap is not in |                                       |  |  |  |  |
|    | a(α) ≠ riig w                                   | the range of the working rmap, however    | , , , , , , , , , , , , , , , , , , , |  |  |  |  |
|    |                                                 | the target register used in the working   |                                       |  |  |  |  |
| J  |                                                 | rmap is in the range of the entry rmap.   |                                       |  |  |  |  |

| Ta | Table 2: Register Map Synchronization Scenarios |                                           |              |  |  |  |  |  |
|----|-------------------------------------------------|-------------------------------------------|--------------|--|--|--|--|--|
|    | Case                                            | Description                               | Action       |  |  |  |  |  |
| 9  | a ∈ (dom W ∩                                    | W(al=>t1,ax=>t2,)                         | Spill(E(a))  |  |  |  |  |  |
| }  | dom E)                                          | E(al=>t2,ay=>t1,)                         | Copy W(a) => |  |  |  |  |  |
|    | _                                               | The abstract register in the working rmap | E(a)         |  |  |  |  |  |
|    | W(a) ∈ rng E                                    | is in the entry rmap. Both use different  | FreeNoSpill( |  |  |  |  |  |
|    |                                                 | target registers. However, the target     | W(a))        |  |  |  |  |  |
|    | E(a) e rng W                                    | register used in the entry rmap is in the |              |  |  |  |  |  |
|    |                                                 | range of the working rmap, and the target |              |  |  |  |  |  |
|    | ^                                               | register used in the working rmap is in   |              |  |  |  |  |  |
|    | W(a) ≠ E(a)                                     | the range of the entry rmap.              |              |  |  |  |  |  |
| 1  | a ∈ (dom W ∩                                    | W(a1=>t1,)                                | Nil          |  |  |  |  |  |
| 0  | dom E)                                          | E(al=>t1,)                                |              |  |  |  |  |  |
|    | ^                                               | The abstract register in the working rmap |              |  |  |  |  |  |
|    | W(a) ∈ rng E                                    | is in the entry rmap. Furthermore they    |              |  |  |  |  |  |
|    |                                                 | both map to the same target register.     |              |  |  |  |  |  |
|    | E(a) e rnq W                                    |                                           |              |  |  |  |  |  |
|    |                                                 |                                           |              |  |  |  |  |  |
|    | ^                                               |                                           |              |  |  |  |  |  |
|    | W(a) = E(a)                                     |                                           |              |  |  |  |  |  |

The translator 19 performs two levels of register allocation within a group block, global and local (or temporary). Global register allocation is the definition of particular register mappings, before code generation, entire group block (i.e., which persist across an throughout all member blocks). Local register allocation consists of the register mappings created in the process of code generation. Global register allocation defines which allocation constraints register particular parameterize the code generation of member blocks, by constraining local register allocation.

10

Abstract registers that are globally allocated do not require synchronization on member block boundaries, because they are guaranteed to be allocated to the same respective target registers in every member block. This

approach has the advantage that synchronization code (which compensates for differences in register mappings between blocks) is never required for globally allocated abstract registers on member block boundaries. The disadvantage of group block register mapping is that it hinders local register allocation because the globally allocated target registers are not immediately available for new mappings. To compensate, the number of global register mappings may be limited for a particular group block.

The number and selection of actual global register allocations is defined by a global register allocation global register allocation policy policy. The configurable based on subject architecture, architecture, and applications translated. The optimal of globally allocated registers is derived empirically, and is a function of the number of target registers, the number of subject registers, the type of application being translated, and application usage patterns. The number is generally a fraction of the total number of target registers minus some small number to ensure that enough target registers remain for temporary values.

25

30

10

15

20

In cases where there are many subject registers but few target registers, such as the MIPS-X86 and PowerPC-X86 translators, the number of globally allocated registers is zero. This is because the X86 architecture has so few target registers that using any fixed register allocation has been observed to produce worse target code than none at all.

In cases where there are many subject registers and many target registers, such as the X86-MIPS translator, the number of globally allocated registers (n) is three quarters the number of target registers (T). Hence:

5

10

25

30

X86-MIPS:

n

3/4 \*

T

Even though the X86 architecture has few general purpose registers, it is treated as having many subject registers because many abstract registers are necessary to emulate the complex X86 processor state (including, e.g., condition code flags).

In cases where the number of subject registers and target registers is approximately the same, such as the MIPS-MIPS accelerator, most target registers are globally allocated with only a few reserved for temporary values. Hence:

20 MIPS-MIPS: n = T - 3

In cases where the total number of subject registers in use across the entire group block (s) is less than or equal to the number of target registers (T), all subject registers are globally mapped. This means that the entire register map is constant across all member blocks. In the special case where (s = T), meaning that the number of target registers and active subject registers is equal, this means that there are no target registers left for temporary calculations; in this case, temporary values are locally allocated to target registers that are globally allocated to subject registers that have no further uses

within the same expression tree (such information is obtained through liveness analysis).

At the end of group block creation, code generation is performed for each member block, in the traversal order. During code generation, each member block's IR forest is (re)generated and the list of dead subject registers (contained in that block's liveness information) is used to the prune the IR forest prior to generating target code. As each member block is translated, its exit register map is propagated to the entry register maps 40 of all successor member blocks (except those which have already been fixed). Because blocks are translated in traversal order, this has the effect of minimizing register map synchronization along hot paths, as well as making hot path translations contiguous in the target memory space. As with basic block translations, group member block translations are specialized on a set of entry conditions, namely the current working conditions when the group block was created.

10

15

20

Figure 7 provides an example of group block generation by the translator code 19 according to an illustrative embodiment. The example group block has five members ("A" 25 to "E"), and initially one entry point ("Entry 1"; Entry 2 generated later through aggregation, as discussed below) and three exit points ("Exit 1," "Exit 2," and "Exit In this example, the trigger threshold for group block creation is an execution count of 45000, and the inclusion threshold for member blocks is an execution 30 count of 1000. The construction of this group block was triggered when block A's execution count (now 45074) reached the trigger threshold of 45000, at which point a

search of the control flow graph was performed in order to identify the group block members. In this example, five blocks were found that exceeded the inclusion threshold of 1000. Once the member blocks are identified, an ordered depth first search (ordered by profiling metric) is performed such that hotter blocks and their successors are processed first; this produces a set of blocks with a critical path ordering.

dead code elimination 10 Αt stage global Each member block is analyzed for register performed. uses and definitions (i.e., liveness analysis). This makes code generation more efficient in two ways. local register allocation can take into account which subject registers are live in the group block (i.e., which 15 subject registers will be used in the current or successor member blocks), which helps to minimize the cost of spills; dead registers are spilled first, because they do not need to be restored. In addition, if analysis shows that a particular subject register 20 defined, used, and then redefined (overwritten), the value can be thrown away any time after the last use (i.e., its target register can be freed). If liveness analysis shows that a particular subject register value is defined and then redefined without any intervening uses (unlikely, as 25 this would mean that the subject compiler generated dead code), then the corresponding IR tree for that value can be thrown away, such that no target code is ever generated for it.

30

Global register allocation is next. The translator 19 assigns frequently accessed subject registers a fixed target register mapping which is constant across all

member blocks. Globally allocated registers are nonspillable, meaning that those target registers unavailable to local register allocation. A percentage of target registers must be kept for temporary subject register mappings when there are more subject registers than target registers. In special cases where the entire set of subject registers within the group block can fit into target registers, spills and fills are completely avoided. As illustrated in Figure 7, the translator plants code ("Pr1") to load these registers from the global register store 27 prior to entering the head of the group block ("A"); such code is referred to as prologue loads.

5

10

25

The group block is now ready for target code generation. During code generation, the translator 19 uses a working register map (the mapping between abstract registers and target registers) to keep track of register allocation. The value of the working register map at the beginning of each member block is recorded in that block's associated entry register map 40.

First the prologue block Pr1 is generated which loads the globally allocated abstract registers. At this point the working register map at the end of Pr1 is copied to the entry register map 40 of block A.

Block A is then translated, planting target code directly following the target code for Prl. Control flow code is planted to handle the exit condition for Exit 1, which consists of a dummy branch (to be patched later) to epilogue block Epl (to be planted later). At the end of block A, the working register map is copied to the entry

register map 40 of block B. This fixing of B's entry register map 40 has two consequences: first, no synchronization is necessary on the path from A to B; second, entry to B from any other block (i.e., a member block of this group block or a member block of another group block using aggregation) requires synchronization of that block's exit register map with B's entry register map.

10

15

20

25

30

Block B is next on the critical path. Its target code is planted directly following block A, and code to handle the two successors, C and A, is then planted. The first successor, block C, has not yet had its entry register map 40 fixed, so the working register map is simply copied into C's entry register map. The second successor, block A, however, has previously had its entry register map 40 fixed and therefore the working register map at the end of block B and the entry register map 40 of block A may differ. Any difference in the register maps requires some synchronization ("B-A") along the path from block B to block A in order to bring the working register map into line with the entry register map 40. This synchronization takes the form of register spills, fills, and swaps and is detailed in the ten register map synchronization scenarios above.

Block C is now translated and target code is planted directly following block C. Blocks D and E are likewise translated and planted contiguously. The path from E to A again requires register map synchronization, from E's exit register map (i.e., the working register map at the end of E's translation) to A's entry register map 40, which is planted in block "E-A."

Prior to exiting the group block and returning control to the translator 19, the globally allocated registers must be synchronized to the global register store; this code is referred to as epilogue saves. After the member blocks have been translated, code generation plants epilogue blocks for all exit points (Ep1, Ep2, and Ep3), and fixes the branch targets throughout the member blocks. In embodiments that use both isoblocks and group blocks, 10 the control flow graph traversal is made in terms of unique subject blocks (i.e., a particular basic block in the subject code) rather than isoblocks of that block. such, isoblocks are transparent to group block creation. No special distinction is made with respect to subject blocks that have one translation or multiple translations. In the illustrative embodiment, both the group block and isoblock optimizations may be advantageously employed. However, the fact that the isoblock mechanism may create different basic block translations for the same subject code sequence complicates the process of deciding which blocks to include in the group block, since the blocks to be included may not exist until the group block is formed. The information collected using the unspecialized blocks that existed prior to the optimization must be adapted before being used in the selection and layout process.

15

20

25

30

The illustrative embodiment further employs technique for accommodating features of nested loops group block generation. Group blocks are originally created with only one entry point, namely the start of the trigger block. Nested loops in a program cause the inner first, creating a loop to become hot group block representing the inner loop. Later, the outer loop

becomes hot, creating a new group block that includes all the blocks of the inner loop as well as the outer loop. If the group block generation algorithm does not take account of the work done for the inner loop, but instead re-does all of that work, then programs that contain deeply nested loops will progressively generate larger and larger group blocks, requiring more storage and more work on each group block generation. In addition, the older (inner) group blocks may become unreachable and therefore provide little or no benefit.

10

15

20

25

30

According to the illustrative embodiment, group block aggregation is used to enable a previously built group block to be combined with additional optimized blocks. During the phase in which blocks are selected inclusion in a new group block, those candidates which are already included in a previous group block are identified. Rather than planting target code for these blocks, aggregation is performed, whereby translator the creates a link to the appropriate location in the existing Because these links may jump to the middle group block. of the existing group block, the working register map enforced; location must be that corresponding to the code planted for the link includes accordingly, register map synchronization code as required.

The entry register map 40 stored in the basic block data structure 30 supports group block aggregation. Aggregation allows other translated code to jump into the middle of a group block, using the beginning of the member block as an entry point. Such entry points require that the current working register map be synchronized to the member block's entry register map 40, which the translator

19 implements by planting synchronization code (i.e., spills and fills) between the exit point of the predecessor and the entry point of the member block.

In one embodiment, some member blocks' register maps 5 are selectively deleted to conserve resources. Initially, the entry register maps of all member blocks in a group are stored indefinitely, to facilitate entry into group block (from an aggregate group block) beginning of any member block. As group blocks become 10 large, some register maps may be deleted to conserve memory. If this happens, aggregation effectively divides the group block into regions, some of which (i.e., member whose register maps have been deleted) 15 inaccessible to aggregate entry. Different policies are used to determine which register maps to store. policy is to store all register maps of all member blocks (i.e., never delete). An alternative policy is to store register maps only for the hottest member blocks. alternative policy is to store register maps only for 20 member blocks that are the destinations of backward branches (i.e., the start of a loop).

In another embodiment, the data associated with each group member block includes a recorded register map for every subject instruction location. This allows other translated code to jump into the middle of a group block at any point, not just the beginning of a member block, as, in some cases, a group member block may contain undetected entry points when the group block is formed. This technique consumes large amounts of memory, and is therefore only appropriate when memory conservation is not a concern.

Group blocking provides a mechanism for identifying of blocks or sets frequently executed blocks performing additional optimizations on them. Because more computationally expensive optimizations are applied to group blocks, their formation is preferably confined to basic blocks which are known to execute frequently. the case of group blocks, the extra computation is justified by frequent execution; contiguous blocks which are executed frequently are referred to as a "hot path." Embodiments may be configured wherein multiple levels of frequency and optimization are used, such that detects multiple tiers of frequently translator 19 increasingly complex executed basic blocks, and optimizations are applied. Alternately, and as described above only two levels of optimization are used: optimizations are applied to all basic blocks, single set of further optimizations are applied to group blocks using the group block creation mechanism described above.

## Overview

10

15

20

25

30

Figure 8 illustrates the steps performed by the translator at run-time, between executions of translated code. When a first basic block  $(BB_{N-1})$  finishes execution 1201, it returns control to the translator 1202. The translator increments the profiling metric of the first basic block 1203. The translator then queries the basic block cache 1205 for previously translated isoblocks of the current basic block  $(BB_N)$ , which is  $BB_{N-1}$ 's successor), using the subject address returned by the first basic block's execution. If the successor block has already

been translated, the basic block cache will return one or more basic block data structures. The translator then compares the successor's profiling metric to the group block trigger threshold 1207 (this may involve aggregating the profiling metrics of multiple isoblocks). If the threshold is not met, the translator then checks if any isoblocks returned by the basic block cache are compatible with the working conditions (i.e., isoblocks with entry conditions identical to the exit conditions of  $BB_{N-1}$ ). If a compatible isoblock is found, that translation is executed 1211.

10

15

20

25

If the successor profiling metric exceeds the group block trigger threshold, then a new group block is created 1213 and executed 1211, as discussed above, even if a compatible isoblock exists.

If the basic block does not return any isoblocks, or none of the isoblocks returned are compatible, then the current block is isoblock translated 1217 into an specialized on the current working conditions, discussed above. At the end of decoding  $BB_N$ , if the successor of  $BB_N$  ( $BB_{N+1}$ ) is statically determinable 1219, then an extended basic is created 1215. If an extended basic block is created, then  $BB_{N+1}$  is translated 1217, and so forth. When translation is complete, the new isoblock is stored in the basic block cache 1221 and then executed 1211.

Although a few preferred embodiments have been shown and described, it will be appreciated by those skilled in the art that various changes and modifications might be

made without departing from the scope of the invention, as defined in the appended claims.

Attention is directed to all papers and documents which are filed concurrently with or previous to this specification in connection with this application and which are open to public inspection with this specification, and the contents of all such papers and documents are incorporated herein by reference.

10

15

20

All of the features disclosed in this specification (including any accompanying claims, abstract and drawings), and/or all of the steps of any method or process so disclosed, may be combined in any combination, except combinations where at least some of such features and/or steps are mutually exclusive.

Each feature disclosed in this specification (including any accompanying claims, abstract and drawings) may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise. Thus, unless expressly stated otherwise, each feature disclosed is one example only of a generic series of equivalent or similar features.

25

30

The invention is not restricted to the details of the foregoing embodiment(s). The invention extends to any novel one, or any novel combination, of the features disclosed in this specification (including any accompanying claims, abstract and drawings), or to any novel one, or any novel combination, of the steps of any method or process so disclosed.

## Claims

A method of translation of a plurality of basic
 blocks of program code comprising:

decoding a first basic block;

detecting whether the subject starting address of a successor block of said first basic block is statically determinable;

if said successor basic block is statically determinable, calculating the subject starting address of said successor block; and

resuming the decoding process at the subject starting address of said successor block.

20 2. The method of claim 1, wherein said first basic block includes an unconditional jump whose subject destination address is statically determinable and wherein said decoding process is resumed at said subject destination address.

25

3. The method of claim 1 or 2, wherein if said successor block is statically determinable, an IR tree corresponding to the dynamic calculation of the starting subject address of said successor block is pruned.

30

4. The method of claim 1, 2 or 3, wherein if said successor block is not statically determinable, target

code generated for said first basic block is executed prior to translation of said successor basic block.

- 5. The method of claim 4, further including the step of synchronizing a set of abstract registers at the end of execution of said basic block and prior to decoding of said successor block.
- 6. The method of any preceding claim performed as part of a translation process including a control loop comprising the steps of:

executing translator code which translates a first block of subject code into translated code;

15

executing the translated code;

returning control to the translator code at the end of execution of the translated code; and

20

executing translator code which translates a second basic block into translated code.

7. A method of translating a plurality of basic blocks of program code comprising:

applying a selection algorithm to identify a group of previously translated basic blocks for translation as a contiguous whole; and

30

generating an intermediate representation for each basic block of said group.

8. The method of claim 7, wherein the step of applying a selection algorithm comprises checking a profiling metric of previously executed basic block against at least one profiling threshold.

5

- 9. The method of claim 8, wherein said profiling metric comprises an execution count.
- 10. The method of claim 8 or 9, wherein said profiling threshold comprises a selected maximum count.
  - 11. The method of any of claims 7 to 10, wherein the step of applying a selection algorithm comprises the steps of:

15

beginning group block formation wherein a trigger threshold is exceeded; and

selecting additional blocks for inclusion in the group 20 based on an inclusion threshold.

12. The method of claim 11, wherein group block formation is terminated when the number of blocks selected

for inclusion in the group block reaches a selected limit.

25

- 13. The method of claim 11 or 12, wherein additional blocks of the group are selected based on data identifying known successors of a preceding block.
- 30 14. The method of claim 13, wherein said data comprises references to basic block objects of predecessor and successor blocks, said references

comprising in the aggregate the control-flow graph of all previously executed basic blocks.

- 15. The method of any of claims 7 to 14, further including the step of performing an ordering traversal of said group of basic blocks to determine the order in which the blocks of said group will be translated.
- 16. The method of claim 15, wherein the ordering traversal is performed using an ordered depth-first search algorithm.
  - 17. The method of claim 16, wherein said algorithm orders the blocks based on execution count.
- 18. The method of claim 16 or 17, further comprising the step of performing global dead code elimination on said group block.
- 20 19. The method of claim 18, further comprising the step of global register allocation.
- 20. The method of any of claims 7 to 19, performed as part of a translation process including a control loop comprising the steps of:

executing translator code which translates a first block of subject code into translated code;

30 executing the translated code;

15

returning control to the translator code at the end of execution of the translated code; and

executing translator code which translates a second basic block into translated code.

- 5 21. The method of any of claims 7 to 20, further including the steps of register allocation and register synchronization.
- 22. The method of claim 21, wherein the step of register synchronization comprises performing a 10-case register synchronization algorithm.
- 23. The method of any preceding claim, wherein said steps are performed by program code stored on a tangible storage medium.
  - 24. A basic block data structure stored on a computer readable medium comprising:
- 20 a subject address;
  - a target code pointer;
  - a set of translation hints;

25

- a set of entry conditions;
- a set of exit conditions;
- 30 a profiling metric;

predecessor block data;

successor block data; and

an entry register map.

- 5 25. A method of translation of a plurality of basic blocks of program code, substantially as hereinbefore described with reference to the accompanying drawings.
- 26. A basic block data structure stored on a computer readable medium, substantially as hereinbefore described.

## ABSTRACT

## BLOCK TRANSLATION OPTIMIZATIONS FOR PROGRAM CODE CONVERSION

Subject program code 17 is translated to target code 21 in basic block units at run-time in a process wherein translation of basic blocks 23 is interleaved with execution of those translations. A combination of processes designed to enhance the speed and efficiency of run-time translation are applied based on characteristics of particular blocks and include translating a set of contiguous basic blocks prior to execution ("extended basic blocks") and grouping and ordering of frequently executed basic blocks for translation ("group blocking").

[Figure 1]

20

5

10

15

**第分扩展支持** 



F16. 1



FIG. #2



BASIC BLOCK DATA STRUCTURE & BASIC BLOCK CACHE

F16.3

TRANSCATE BB"A" 51 YE5 TRANSCATE PRUNE B CALC ADOR -59 PRUNG 1-1G 4 EXTENDED BASIC BLOCKS. EXECUTION SUBJECT FRANSLATION) 73 NO ADDRETS TRANSCATE No. CONDITION MAZZH MANSLAGE BB ARGET CODE

F/G. 5 ISOBLOCKING

GROUP BLOCKS ORDER BLOCKS GLOBAL DEAD CODE EZIMINATION GLOBAL REGISTER ALLOCATION TARGET

F16. 6

GENERATION



(

**FIG.** 7

